CAGEF_services_slide.png

Lecture 03: Tidying your data

0.1.0 A quick intro to the Introduction to R for Data Science

This 'Introduction to R for Data Science' is brought to you by the Centre for the Analysis of Genome Evolution & Function's (CAGEF) bioinformatics training initiative. This CSB1020 was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.

This lesson is the third in a 6-part series. The idea is that at the end of the series, you will be able to import and manipulate your data, make exploratory plots, perform some basic statistical tests, test a regression model, and make some even prettier plots and documents to share your results.

data-science-explore.png

The structure of the class is a code-along style: It is fully hands on. Prior to each lecture, the materials will be emailed to you and will also be available for download at QUERCUS, so you can spend more time coding than taking notes.

0.1.1 Class Objectives

At the end of this session you will know the principles of tidy data, and be able to subset and transform your data to perform simple calculations. You will be able to add new rows and columns to your data frame.

  1. Learn tidy data principles and how to convert from wide-format to long-format data.
  2. Learn how to cut and query our dataset for different information.
  3. Learn how to revert data back from long-format to wide-format.

0.2.0 How do we get there?

Today we are going to be learning about tidy data and how it makes data analysis less of a pain. We will perform some basic statistics on our newly transformed dataset. The next lesson will be data cleaning and string manipulation; this is really the battleground of coding - getting your data into the format where you can analyse it. After that, we will make all sorts of plots - from simple data exploration to interactive plots - this is always a fun lesson.

spotify-howtobuildmvp.gif

0.3.0 A legend for text format in Jupyter markdown

grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink

0.4.0 Data basics: The wide and long formats

0.4.1 Definitions:

0.4.2 Wide versus long format

Wide and long (sometimes un-stacked and stacked, or wide and tall, wide and narrow), are terms used to describe how a database is formatted.

In wide format, variables may be listed in the first column, each forming a row of its own. Observations may be presented as columns that fill observed values for each variable.

In long format, each variable is its own column, and the results of each measured variable are recorded in rows. In data science, long format is preferred over wide format because it allows for an easier and more efficient subset and manipulation of the data.

To read more about wide and long formats, visit https://eagereyes.org/basics/spreadsheet-thinking-vs-database-thinking.

wide_and_long_formats.png

0.5.0 Data used in this session

0.5.1 Dataset 1: Latrines

Sequencing of the V3-V5 hypervariable regions of the 16S rRNA gene

16S rRNA gene amplicon sequencing of 30 latrines from Tanzania and Vietnam at different depths (multiples of 20cm). Microbial abundance is represented in Operational Taxonomic Units (OTUs). Operational Taxonomic Units (OTUs) are groups of organisms defined by a specified level of DNA sequence similarity at a marker gene (e.g. 97% similarity at the V4 hypervariable region of the 16S rRNA gene). Intrinsic environmental factors such as pH, temperature, organic matter composition were also recorded.

0.5.1.1: taxa_pitlatrine_wide.csv

An OTU-taxa table. (Naming conventions: [Country_LatrineNo_Depth]) with sample names and environmental variables.

0.5.1.2: Data source

B Torondel, JHJ Ensink, O Gunvirusdu, UZ Ijaz, J Parkhill, F Abdelahi, V-A Nguyen, S Sudgen, W Gibson, AW Walker, and C Quince. Assessment of the influence of intrinsic environmental and geographical factors on the bacterial ecology of pit latrines Microbial Biotechnology, 9(2):209-223, 2016. DOI:10.1111/1751-7915.12334

latrines_wide_to_long.png

0.5.2 Dataset 2: Microbes

This dataset is the result of 16S rRNA gene amplicon sequencing of samples from microbial communities cultured in fresh, brackish, or saline media. Treatments received the aromatic compounds toluene or pyrene as the sole source of carbon and energy. Controls did not receive any compounds (substrate-free) to account for any alternative carbon sources present in the media. The objective of this experiment was to evaluate which microorganisms would make use of toluene and pyrene.

0.5.2.1: microbes_wide.csv

Taxa table with gene sequences (ASV) and sample information. This file must be converted from wide to long format before use.

0.6.0 Packages Used in This Lesson

tidyverse (tidyr, dplyr, tibble)

gapminder

Some of these packages should already be installed into your Anaconda base from Lecture 02. If not, please review that lesson and load these packages. Remember to please install these packages from the conda-forge channel of Anaconda.

To install gapminder in the Anaconda prompt, use the command:

conda install conda-forge r-gapminder

Note that 8 different packages are loaded, and that 2 functions from the stats package have been replaced by functions of the same name by dplyr. Note that you can still access the stats version of the function by calling it directly as stats::filter().

0.7.0 Some data preparation

In this lesson we want to answer 3 simple questions:

Last lesson, we learned how to filter and select data subsets we were interested in. However, we can make data manipulation more efficient by controlling the overall structure or format of our data.

making_progress.png

Let's read in our dataset, store it in a variable, and remind ourselves about the original structure.

1.0.0 Introduction to tidy data

Why tidy data?

Data cleaning (or dealing with 'messy' data) accounts for a huge chunk of data scientist's time. Ultimately, we want to get our data into a 'tidy' format (long format) where it is easy to manipulate, model and visualize. Having a consistent data structure and tools that work with that data structure can help this process along.

Tidy data has:

  1. Each variable form a column.
  2. Each observation form a row.
  3. Each type of observational unit form a table.

This seems pretty straight forward, and it is. It is the datasets you get that will not be straight forward. Having a map of where to take your data is helpful to unraveling its structure and getting it into a usable format.

The 5 most common problems with messy datasets are:

Fortunately there are some tools available to solve these problems.

1.1.0 Introduction to the tidyverse

The tidyverse is the universe of packages created by Hadley Wickham for data analysis. There are packages to help import, tidy, transform, model and visualize data. His packages are pretty popular, so he made a package to load all of his packages at once. This wrapper package is tidyverse. In this lesson series we have used dplyr, readr and readxl, and we will be using dplyr and tidyr today.

tidyverse1.png

Hadley has a large fan-base. Someone even made a plot of Hadley using his own package, ggplot2.

HadleyObama2.png

Back to the normalverse...


1.2.0 Assessing our data frame

Which tidy data rules might our data frame break?

At first glance we can see that the column names are actually 3 different variables: 'Country', 'LatrineNumber', and 'Depth'. This information will likely be useful in our study, as we expect different bacteria at different depths, sites, and geographical locations. Each of these is a variable and should have its own separate column.

We could keep the column names as the sample names (as they are meaningful to the researcher) and add the extra variable columns, or we could make up sample names (ie. Sample_1) knowing that the information is not being lost, but rather stored in a more useful format.

Some of the Taxa also appear to have an additional variable of information (ie. _Gp1), but not all taxa have this information. We could also make a separate column for this information.

Each result is the same observational unit (i.e. relative abundances of bacteria), so having one table is fine.

1.3.0 Introduction to helpful functions in tidyr

tidyr is a package with functions that help us turn our 'messy' data into 'tidy' data. It has 2 major workhorse functions and 2 other tidying functions:

  1. gather() - convert a data frame from wide to long format
  2. spread() - convert a data frame from long to wide format
  3. separate() - split a column into 2 or more columns based on a string separator
  4. unite() - merge 2 or more columns into 1 column using a string separator

gather() and spread() rely on key-value pairs to collapse or expand columns.

We've already loaded tidyverse which includes the tidyr package that the gather() function is from.

1.3.1 gather() your data from across columns

We can use the gather() function to collect our columns. This will make our dataset 'long' instead of 'wide'.

gather(
  data,
  key = "key",
  value = "value",
  ...,
  na.rm = FALSE,
  convert = FALSE,
  factor_key = FALSE
)

We need to provide gather() with information on our new columns.

  1. The first argument data is our data frame (actually an object called a tibble but close enough...)
  2. the second argument is the key, which is the name for the variable we want to gather (a set of column names). In this case our columns represent latrine sites, so we can assign the name 'Sites' to our key.
  3. The next argument is the value, which is the name for the measurements (usually numeric or integer) we have. In this case our values are Operational Taxonomic Units, so we can assign the name 'OTUs' to our value.
  4. The fourth argument is all of data/value the columns that we want to gather. You can specify the columns by listing their names or positions.

Note: This function has been retired in tidyr and replaced with pivot_longer() which has many more options allowing for more user-friendly and flexible conversion. For our purposes gather() will do just fine.


First let's pull our data into a data frame.

In the above examples -1 means gather every column except the 1st, or gather every column except "Taxa". Taxa would still be retained as a column but its elements are not grouped in with 'sites' as an observation (i.e. we do not want 'Vietnam_9_4', 'Tanzania_2_9', and 'Clostridia' gathered into the same column).

Let's save the last variation into a data frame called gathered_data.

1.3.1.1 What have we done using gather()?

Note how the dimensions of your dataframe have changed relative to data. Instead of 52 rows and 82 columns, we now have a data frame with 4,212 rows and 3 columns (which is the 81 columns we gathered x 52 rows). gathered_data is now in a long format instead of wide.

1.3.2 separate() can split variables into multiple columns by specifying a text delimiter

Notice the information contained in our "Site" data? It is a combination of 3 values: country, latrine number, and depth information all separated with "_" between each.

We can use the separate() function to retrieve the Country, Latrine_Number, and Depth information from our Site column. separate() takes in your dataframe, the name of the column to be split, the names of your new columns, and the character that you want to split the columns by (in this case an underscore). Note that the default is to remove your original column - you can keep it by adding the argument remove = FALSE, keeping in mind that you now have redundant data.

separate(
  data,
  col,
  into,
  sep = "[^[:alnum:]]+",
  remove = TRUE,
  convert = FALSE,
  extra = "warn",
  fill = "warn",
  ...
)

We need to provide separate() with information to help split our information.

  1. The first argument data is our data frame (aka tibble)
  2. the second argument is col, which is the name or position of the column we want to split.
  3. The next argument is into, which is a list of the column names we want to split from argument 2.
  4. The fourth argument is sep, tells separate() how to break up the information in each row of col.

We may also want to do this for the 'Group' of Acidobacteria. Try the code, but do not save the answer in a variable.

Notice that we have triggered a warning from R that it has filled in 'NA' for the bacteria that did not have groups. Also see that I chose to split Taxa using '_Gp' since I did not need 'Gp'.


1.3.2.1 Challenge: compare glimpse() and str() information for our split data

Use the glimpse() function to look at the type of each variable in our new data frame, split_gathered_data. Are those the types you expected? Why or why not? How is glimpse() different from the str() function?


1.3.2.2 Recall that we can group_by() to organize our dataset

There is a useful function group_by() that you can use to group variables or sets of variables together. This is useful for calculations and plotting on subsets of your data without having to turn your variables into factors.

Say I wanted to look at a combination of Country and Well Depth. While visually, you wouldn't notice any changes to your data frame, if you look at the structure it will now be a 'grouped_df'. There are 15 groupings resulting from Country and Depth. After I have performed my desired operation, I can return my data frame to its original structure by calling ungroup().

First we will examine the structure of grouped and ungrouped output without any additional operations.

1.3.2.3 We can combine group_by() and summarize() functions to produce sensible stat data

Now we can see an example of how group_by() in action with summarize() (or summarise()), can easily calculate summary statistics for groups of data. Whereas in our messy data frame it was difficult to do calculations based on Country, Well Number or Latrine Depth, this is now an easy task. Let's get the mean, median, standard deviation and maximum value for the number of OTUs collected in Tanzania vs Vietnam.

In dealing with grouped data, we no longer have to grab a Country by subsetting or using helper functions to grab letters from their names. group_by() produces that modified data frame which recognizes that we have 2 countries and will perform calculations for both of them.


2.0.0 Answering questions about our data

Now that we have tidy data, let's proceed to answering our questions:

2.1.0 Question: Which latrine has the greatest mean number of OTUs?

What steps do we need to take to answer this question? Before we dive in, let's consider what we need to answer this question.

  1. We need to know group data from the same latrine together.
  2. We need the mean OTU value from each latrine group.
  3. We need to sort or identify the mean OTU with the highest number

Recall our original data set had three variables that were mashed together into a Site (Country_LatrineNo_Depth) and answering our question would have been difficult. With tidy data format we have split a combined variable column into three, making latrine_Number a variable. This allowed us to simply group our data to perform our mean calculation and get an answer.


2.2.0 Question: Is there more Clostridia in Tanzania or Vietnam?

This is a bit of variation on our first question. What kind of data do we need?

  1. A filtered data set where we only look at Clostridia entries.
  2. We need to group based on country.
  3. A sorted summary of the OTU data.

Again, being able to filter by Taxa and group by Country (as an isolated variable) helps a lot. With dplyr syntax we can perform all data manipulations and calculations in a code block that is readable. Note that we grouped by both Country and Taxa. This was done more for visual purposes to confirm that Clostridia was the only taxa coming out of our filter. We could easily have done group_by(Country) with the same result. Our final output, however, would not include the Taxa column.


2.3.0 Question: Which site had the greatest number of Taxa represented?

Lets break down the question again into the basic components!

  1. A sampling 'site' is unique to a Country, Latrine Number, and Depth so we will need to sort/group using these aspects
  2. Exclude any taxa that weren't observed at our site
  3. Use the n() command from dplyrto count a group out
  4. Sort the data to find the most abundance site

Since we can group by the 3 variables that were in the Site name, there is no disadvantage to having our data in tidy format compared to our original wide data frame. However now we are able to filter for non-zero OTUs, which was impossible in the wide format. Since we know from earlier in the lesson that each Taxa is only represented once for each site, we only have to count and order the number of observations to get our answer.


3.0.0 Getting back to the way we were

To get data back into its original format, there are reciprocal functions in the tidyr package, making it possible to switch between wide and long formats.

Fair question: But you've just been telling me how great the 'long' format is?!?! Why would I want the wide format again???

Honest answer: Note that our original data frame was 52 rows and expanded to 4212 rows in the long format. When you have, say, a genomics dataset you might end up with 6,000 rows expanding to 600,000 rows. You probably want to do your calculations and switch back to the more "human writeable/readable" format. Sure, I can save a data frame with 600,000 rows, but I can't really send it to anyone because spreadsheet software such as Excel might crash while trying to open the file.

3.1.0 unite() your columns back again

unite(
  data, 
  col, 
  ..., 
  sep = "_", 
  remove = TRUE, 
  na.rm = FALSE
)

The opposite of separate(), we need to provide unite() with information to help consolidate our information.

  1. The first argument, data, is our data frame (aka tibble)
  2. the second argument, col, is the name of the column where we want to keep our combined data.
  3. The next argument, "...", is a list of the column names we want to join into argument 2.
  4. The fourth argument, sep, tells unite() what kind of character to put between recombined data values going into col.

3.1.1 Challenge

Collapse Country, Latrine_Number and Depth back into one variable, 'Site', using the unite() function. Store the output in a data frame called united_data.

3.2.0 Use spread() to convert your data from long to wide format

spread(
  data, 
  key, 
  value, 
  fill = NA, 
  convert = FALSE, 
  drop = TRUE, 
  sep = NULL
)

The opposite of gather(), we need to provide spread() with information to help consolidate our information. This can be tricky to conceptualize BUT the goal is to consolidate row entries based on specific columns that we do NOT name.

  1. The first argument, data, is our data frame (aka tibble)
  2. the second argument, key, is the name of the column we will use to re-categorize or group our data into new columns. Usually the remaining columns with unique values and combinations remain as rows and newly generated columns are different values found in our key column.
  3. The next argument, value, is the column names we will use to re-populate our new columns.

Note: Much like gather(), the spread() function has been retired in tidyr and replaced with pivot_wider() which has many more options allowing for more user-friendly and flexible conversion. For our purposes spread() will do just fine.

3.2.1 Revert our long format dataset back to wide

Use the spread() function to turn unite_data into the wide shape of our original dataset. Save the output into a data frame called spread_data.

3.2.2 Save our data frame to a text file

We'll use the standard write.csv command to save our results to our data folder.


4.0.0 Example challenges for working with tidy data

4.1.0 Filter data and subset it

  1. Filter unite_data to remove all non-zero values, and store it in as unite_data_no_zeros. How many rows in unite_data had the value of zero?
  2. Sort unite_data_no_zeros to keep the top 20 rows with the highest OTUs.

4.2.0 Convert microbes_wide.csv to long format

Reshape microbes_wide.csv into a long format version called microbes_long. Make sure the columns are in the following order: "ASV", "salinity", "compound", "group", "replicate", "kingdom", "phylum", "class", "order", "family", "genus", and "abundance".

Once you're done, write microbes_long as microbes.csv.

Nots about microbes_wide.csv

  1. Each observation name is a combination of the salinity, compound, group, and replicate
  2. Each taxa is a combination of kingdom, phylum, class, order, family, and genus
  3. Measured values for each observation are overall abundance of the specific microbe.

4.3.0 Convert gapminder_wide to a long format

Read in the gapminder_wide.csv. What rules of tidy data does it break? Transform the dataset to the format below.

continent country year lifeExp pop gdpPercap
Asia Afghanistan 1952 28.801 8425333 779.4453
Asia Afghanistan 1957 30.332 9240934 820.8530
Asia Afghanistan 1962 31.997 10267083 853.1007
Asia Afghanistan 1967 34.020 11537966 836.1971
Asia Afghanistan 1972 36.088 13079460 739.9811
Asia Afghanistan 1977 38.438 14880372 786.1134
  1. How many rows do you have?
  2. Save the newly reshaped document as gapminder_long

Reshape gapminder_wide into a long, cleansed format. BEFORE you start writing any code, IDENTIFY the formatting deficiencies and then PLAN ahead what you want to achieve.

4.3.1 Issues with gapminder_wide

Observations appear to be split into to visible categories based on continent and country

  1. gdpPercap_year - the GDP per capita for a specific year in that continent/country combination
  2. lifeExp_year - the life expectancy for a specfific year in that continent/country combination
  3. pop_year - the population for a specific year in that continent/country combination

So we are seeing multiple observations per column AND multiple observation types in the table (GPP, life expectancy and population)! What a mess!!

4.3.2 Break up a dual-information variable

We've generated a long format table but we still have obs_type as a variable which traps two kind of information

  1. The observation type (gdpPercap, lifeExp, and pop).
  2. The year of that observation.

We need to break that up into separate variables.

4.3.3 Use spread() to widen your table based on a mixed category observation

We've successfully separate our dual-variable in obs_type and year. However the three observation types are still trapped in the same table and column! We'll have to take a small step back before we can take a step forward. Send it back into slightly wider format!

4.3.4 Use select() to rearrange your columns!

Nearly there! Let's review what we want as a final data table.

continent country year lifeExp pop gdpPercap
Asia Afghanistan 1952 28.801 8425333 779.4453
Asia Afghanistan 1957 30.332 9240934 820.8530
Asia Afghanistan 1962 31.997 10267083 853.1007
Asia Afghanistan 1967 34.020 11537966 836.1971
Asia Afghanistan 1972 36.088 13079460 739.9811
Asia Afghanistan 1977 38.438 14880372 786.1134

Good job!


5.0.0 Resources

https://github.com/wmhall/tidyr_lesson/blob/master/tidyr_lesson.md
http://vita.had.co.nz/papers/tidy-data.pdf
https://thinkr.fr/tidyverse-hadleyverse/
http://stat545.com/bit001_dplyr-cheatsheet.html
http://dplyr.tidyverse.org/articles/two-table.html

5.1.0 Post-Lesson Assessment

Soon after the end of this lecture, a homework will be available for you on DataCamp. You will have until 1200 hours on Thursday, March 11th, to submit your assignment (right before the next lecture). This is a pass-fail assignment where you need to achieve a least 75% out of the total points possible to pass.

5.2.0 Preparation for next time

Please install the following packages from the conda-forge channel for next time:

tidyverse (ggplot2, tidyr, dplyr, forcats) These should already be installed from a previous installation of tidyverse
RColorBrewer
viridis
gridExtra
ggrepel
ggbeeswarm
ggthemes
ggpubr
UpSetR


Thanks for coming!